2024-02-01
ols to fit linear regression models in the presence of missing valuesaregImpute to facilitate principled multiple imputation when fitting regressionsHeart and Estrogen/Progestin Study (HERS)
hers_raw <- read_csv("c06/data/hersdata.csv", show_col_types = FALSE) |>
clean_names() |> mutate_if(is.character, as.factor)
hers1 <- hers_raw |>
filter(diabetes == "no") |>
mutate(subject = as.character(subject)) |>
select(subject, ldl, age, sbp, bmi, ht, smoking, drinkany,
physact, diabetes)
dim(hers1)[1] 2032 10
Goal Predict ldl using age, sbp, bmi, smoking, drinkany, and physact, across both HT levels but restricted to women without diabetes.
hers1 subject ldl age sbp
Length:2032 Min. : 36.8 Min. :44.00 Min. : 83.0
Class :character 1st Qu.:120.6 1st Qu.:62.00 1st Qu.:120.0
Mode :character Median :141.4 Median :67.00 Median :132.0
Mean :145.6 Mean :66.89 Mean :133.4
3rd Qu.:166.0 3rd Qu.:72.00 3rd Qu.:145.0
Max. :351.2 Max. :79.00 Max. :197.0
NA's :7
bmi ht smoking drinkany
Min. :15.21 hormone therapy:1001 no :1733 no :1135
1st Qu.:24.20 placebo :1031 yes: 299 yes : 895
Median :26.89 NA's: 2
Mean :27.67
3rd Qu.:30.27
Max. :54.13
NA's :2
physact diabetes
about as active :674 no :2032
much less active :107 yes: 0
much more active :252
somewhat less active:322
somewhat more active:677
hers1 Codebook (n = 2032)| Variable | Description |
|---|---|
subject |
subject code |
HT |
factor: hormone therapy or placebo |
diabetes |
yes or no (all are no in our sample) |
ldl |
LDL cholesterol in mg/dl |
age |
age in years |
sbp |
systolic BP in mm Hg |
bmi |
body-mass index in kg/m2 |
smoking |
yes or no |
drinkany |
yes or no |
physact |
5-level factor, details next slide |
physact variable# A tibble: 5 × 2
physact n
<fct> <int>
1 about as active 674
2 much less active 107
3 much more active 252
4 somewhat less active 322
5 somewhat more active 677
Comparison is to activity levels for these women just before menopause.
drinkany, bmi and ldlSince drinkany is a factor, we have to do some extra work to impute.
aregImpute from HmiscModel to predict all missing values of any variables, using additive regression bootstrapping and predictive mean matching.
There are four steps.
aregImputearegImpute draws a sample with replacement from the observations where the target variable is not missing.aregImputearegImpute requires specifications of all variables, and several other details:
n.impute = number of imputations, we’ll run 20nk = number of knots to describe level of complexity, with our choice nk = c(0, 3:5) we’ll fit both linear models and models with restricted cubic splines with 3, 4, and 5 knotsaregImputearegImpute requires specifications of all variables, and several other details:
tlinear = FALSE allows the target variable to have a non-linear transformation when nk is 3 or moreB = 10 specifies 10 bootstrap samples will be useddata specifies the source of the variablespr = FALSE suppresses printing of iteration messagesaregImpute Imputation Results
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~ldl + age + smoking + drinkany + sbp +
physact + bmi, data = hers1, n.impute = 20, nk = c(0, 3:5),
tlinear = FALSE, pr = FALSE, B = 10)
n: 2032 p: 7 Imputations: 20 nk: 0
Number of NAs:
ldl age smoking drinkany sbp physact bmi
7 0 0 2 0 0 2
type d.f.
ldl s 1
age s 1
smoking c 1
drinkany c 1
sbp s 1
physact c 4
bmi s 1
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
ldl drinkany bmi
0.041 0.014 0.109
Resampling results for determining the complexity of imputation models
Variable being imputed: ldl
nk=0 nk=3 nk=4 nk=5
Bootstrap bias-corrected R^2 0.0139 0.0149 0.00776 0.0124
10-fold cross-validated R^2 0.0214 0.0180 0.01517 0.0191
Bootstrap bias-corrected mean |error| 28.3594 42.9139 44.09937 39.8266
10-fold cross-validated mean |error| 145.7176 43.5007 45.02428 44.2456
Bootstrap bias-corrected median |error| 22.8301 35.5441 38.85302 32.6386
10-fold cross-validated median |error| 141.4238 36.4102 38.88053 37.3141
Variable being imputed: drinkany
nk=0 nk=3 nk=4 nk=5
Bootstrap bias-corrected R^2 0.0163 0.0113 0.0102 0.00986
10-fold cross-validated R^2 0.0205 0.0249 0.0163 0.01358
Bootstrap bias-corrected mean |error| 0.4470 0.4568 0.4558 0.46624
10-fold cross-validated mean |error| 0.4450 0.4454 0.4476 0.44676
Bootstrap bias-corrected median |error| 0.0000 0.0000 0.0000 0.00000
10-fold cross-validated median |error| 0.0000 0.0500 0.1000 0.00000
Variable being imputed: bmi
nk=0 nk=3 nk=4 nk=5
Bootstrap bias-corrected R^2 0.0845 0.0932 0.0946 0.0847
10-fold cross-validated R^2 0.0864 0.0903 0.0968 0.0899
Bootstrap bias-corrected mean |error| 3.7829 4.8119 4.9226 5.1775
10-fold cross-validated mean |error| 27.6776 4.8359 4.9390 5.1136
Bootstrap bias-corrected median |error| 2.9955 3.9704 3.9371 4.2634
10-fold cross-validated median |error| 27.0143 3.9894 3.9431 4.1876
ldl, we imputed most of the 7 missing subjects in most of the 20 imputation runs to values within a range of around 120 through 200, but occasionally, we imputed values that were substantially lower than 100.drinkany we imputed about 70% no and 30% yes.bmi, we imputed values ranging from about 23 to 27 in many cases, and up near 40 in other cases.helpdat example from Class 5[1] 453 8
[1] "id" "cesd" "age" "sex" "subst" "mcs" "pcs" "pss_fr"
cesd using n = 453 observations and 6 candidate predictors (age, sex, subst, mcs, pcs and pss_fr.)
What happens when we add a non-linear term?
Adding an interaction (product term) depends on the main effects of the predictors we are interacting
Spearman’s \(\rho^2\) is an indicator (not a perfect one) of potential predictive punch, but doesn’t give away the game.
mcs is the most attractive candidate for a non-linear term, as it packs the most potential predictive punch, so if it does turn out to need non-linear terms, our degrees of freedom will be well spent.
mcs actually needs a non-linear term, or will show meaningfully better results if a non-linear term is included. We’d have to fit a model with and without non-linearity in mcs to know that.pcs, also quantitative, has the next most potential predictive punch after mcs.pss_fr and sex.With 453 observations (452 df) we should be thinking about models with modest numbers of regression inputs.
In this case, we might choose to include non-linear terms in just two or three variables (and that’s it) and even that would be tough to justify with this modest sample size.
spear_cesd
Spearman rho^2 Response variable:cesd
rho2 F df1 df2 P Adjusted rho2 n
mcs 0.438 350.89 1 451 0.0000 0.436 453
subst 0.034 7.97 2 450 0.0004 0.030 453
pcs 0.089 44.22 1 451 0.0000 0.087 453
age 0.000 0.12 1 451 0.7286 -0.002 453
sex 0.033 15.56 1 451 0.0001 0.031 453
pss_fr 0.033 15.57 1 451 0.0001 0.031 453
Fit a model to predict cesd using:
mcspcspss_fragesex with the main effect of mcs (restricting our model so that terms that are non-linear in both sex and mcs are excluded), andsubstmod_help2Definitely more than we can reasonably do with 453 observations, but let’s see how it looks.
%ia% tells R to fit an interaction term with sex and the main effect of mcs.
sex as a main effect for the interaction term (%ia%) to work. We already have the main effect of mcs in as part of the spline.mod_help2Linear Regression Model
ols(formula = cesd ~ rcs(mcs, 5) + rcs(pcs, 3) + sex + mcs %ia%
sex + pss_fr + age + subst, data = helpdat, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 349.44 R2 0.538
sigma8.6248 d.f. 12 R2 adj 0.525
d.f. 440 Pr(> chi2) 0.0000 g 10.439
Residuals
Min 1Q Median 3Q Max
-26.7893 -5.9000 0.1545 5.5884 26.1304
Coef S.E. t Pr(>|t|)
Intercept 76.3346 6.2540 12.21 <0.0001
mcs -0.9306 0.2315 -4.02 <0.0001
mcs' 1.6607 2.5040 0.66 0.5075
mcs'' -2.8854 8.3945 -0.34 0.7312
mcs''' 0.2942 7.9390 0.04 0.9705
pcs -0.2341 0.0883 -2.65 0.0083
pcs' -0.0151 0.1000 -0.15 0.8797
sex=male -2.0330 2.5456 -0.80 0.4249
mcs * sex=male -0.0129 0.0783 -0.17 0.8690
pss_fr -0.2569 0.1046 -2.46 0.0144
age -0.0466 0.0569 -0.82 0.4139
subst=cocaine -2.6999 0.9965 -2.71 0.0070
subst=heroin -2.1741 1.0677 -2.04 0.0423
Remember this ANOVA testing is sequential, other than the TOTALS.
Analysis of Variance Response: cesd
Factor d.f. Partial SS MS F
mcs (Factor+Higher Order Factors) 5 26857.364671 5371.472934 72.21
All Interactions 1 2.026255 2.026255 0.03
Nonlinear 3 293.502251 97.834084 1.32
pcs 2 2548.388579 1274.194290 17.13
Nonlinear 1 1.705031 1.705031 0.02
sex (Factor+Higher Order Factors) 2 451.578352 225.789176 3.04
All Interactions 1 2.026255 2.026255 0.03
mcs * sex (Factor+Higher Order Factors) 1 2.026255 2.026255 0.03
pss_fr 1 448.812293 448.812293 6.03
age 1 49.758786 49.758786 0.67
subst 2 611.625952 305.812976 4.11
TOTAL NONLINEAR 4 293.512204 73.378051 0.99
TOTAL NONLINEAR + INTERACTION 5 294.601803 58.920361 0.79
REGRESSION 12 38058.315322 3171.526277 42.64
ERROR 440 32730.174744 74.386761
P
<.0001
0.8690
0.2688
<.0001
0.8797
0.0491
0.8690
0.8690
0.0144
0.4139
0.0170
0.4146
0.5558
<.0001
mod_help2 index.orig training test optimism index.corrected n
R-square 0.5376 0.5513 0.5233 0.0280 0.5096 40
MSE 72.2520 69.8358 74.4984 -4.6627 76.9147 40
g 10.4392 10.5053 10.2718 0.2335 10.2056 40
Intercept 0.0000 0.0000 0.7893 -0.7893 0.7893 40
Slope 1.0000 1.0000 0.9751 0.0249 0.9751 40
summary results for mod_help2summary results for mod_help2 Effects Response : cesd
Factor Low High Diff. Effect S.E. Lower 0.95
mcs 21.676 40.941 19.266 -10.96400 1.23340 -13.38800
pcs 40.384 56.953 16.569 -4.10790 0.73381 -5.55010
pss_fr 3.000 10.000 7.000 -1.79860 0.73225 -3.23780
age 30.000 40.000 10.000 -0.46552 0.56918 -1.58420
sex - female:male 2.000 1.000 NA 2.40260 0.99054 0.45577
subst - cocaine:alcohol 1.000 2.000 NA -2.69990 0.99647 -4.65830
subst - heroin:alcohol 1.000 3.000 NA -2.17410 1.06770 -4.27250
Upper 0.95
-8.539800
-2.665700
-0.359500
0.653130
4.349300
-0.741430
-0.075632
Adjusted to: mcs=28.60242 sex=male
mod_help2
n=453 Mean absolute error=0.386 Mean squared error=0.19775
0.9 Quantile of absolute error=0.704
lm for fitting complex linear modelsWe can certainly assess this big, complex model using lm, too:
lm and olsBut to really delve into the details of how well this complex model works, and to help plot what is actually being fit, we’ll probably want to fit the model using ols.
lm and others that are most easily obtained using ols.hers2 data Analysis of Variance Response: ldl
Factor d.f. Partial SS MS F P
age 1 9330.911 9330.911 6.93 0.0085
smoking 1 8199.755 8199.755 6.09 0.0137
drinkany 1 6444.424 6444.424 4.79 0.0288
sbp 1 9274.287 9274.287 6.89 0.0087
physact 4 10874.528 2718.632 2.02 0.0891
bmi 1 15876.957 15876.957 11.80 0.0006
REGRESSION 9 60077.708 6675.301 4.96 <.0001
ERROR 2022 2721037.890 1345.716
How should we prioritize the degrees of freedom we spend on non-linearity?
hers2 data here. Why?We’re spending 9 degrees of freedom in our kitchen sink model. (We can verify this with anova or the plot.)
Suppose we’re willing to spend up to a total of 16 degrees of freedom (i.e. a combined 7 more on interaction terms and other ways to capture non-linearity.)
Group 1 (largest adjusted \(\rho^2\))
bmi, a quantitative predictor, is furthest to the rightGroup 2 (next largest)
smoking, a binary predictor, is next, followed closely byage, a quantitative predictorOther predictors (rest of the group)
sbp, quantitativedrinkany, binaryphysact, multi-categorical (5 levels)olsFitting a model to predict ldl using
bmi with a restricted cubic spline, 5 knotsage with a quadratic polynomialsbp as a linear termdrinkany indicatorphysact factorsmoking indicator and its interaction with the main effect of bmiWe can fit this to the data:
hers1, effectively)hers2)fit3)where %ia% identifies the linear interaction alone.
m1 resultsFrequencies of Missing Values Due to Each Variable
ldl bmi age sbp drinkany physact smoking
7 2 0 0 2 0 0
Linear Regression Model
ols(formula = ldl ~ rcs(bmi, 5) + pol(age, 2) + sbp + drinkany +
physact + smoking + smoking %ia% bmi, data = hers1, x = TRUE,
y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 2021 LR chi2 52.61 R2 0.026
sigma36.7430 d.f. 14 R2 adj 0.019
d.f. 2006 Pr(> chi2) 0.0000 g 6.629
Residuals
Min 1Q Median 3Q Max
-113.440 -24.519 -3.778 20.940 197.087
Coef S.E. t Pr(>|t|)
Intercept 121.6057 68.2000 1.78 0.0747
bmi 1.5687 1.0107 1.55 0.1208
bmi' -8.6685 9.1577 -0.95 0.3440
bmi'' 40.5712 37.4468 1.08 0.2787
bmi''' -55.8872 44.5946 -1.25 0.2103
age -0.5791 1.9657 -0.29 0.7683
age^2 0.0018 0.0149 0.12 0.9024
sbp 0.1221 0.0453 2.69 0.0072
drinkany=yes -3.7427 1.6629 -2.25 0.0245
physact=much less active -4.5660 3.8904 -1.17 0.2407
physact=much more active -0.3291 2.7521 -0.12 0.9048
physact=somewhat less active -0.0160 2.5270 -0.01 0.9950
physact=somewhat more active 3.7731 2.0293 1.86 0.0631
smoking=yes -7.0832 12.0586 -0.59 0.5570
smoking=yes * bmi 0.4961 0.4391 1.13 0.2587
where, again, %ia% identifies the linear interaction alone.
m2 resultsLinear Regression Model
ols(formula = ldl ~ rcs(bmi, 5) + pol(age, 2) + sbp + drinkany +
physact + smoking + smoking %ia% bmi, data = hers2, x = TRUE,
y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 2032 LR chi2 53.14 R2 0.026
sigma36.6503 d.f. 14 R2 adj 0.019
d.f. 2017 Pr(> chi2) 0.0000 g 6.631
Residuals
Min 1Q Median 3Q Max
-113.379 -24.326 -3.835 20.832 197.097
Coef S.E. t Pr(>|t|)
Intercept 120.2662 67.6113 1.78 0.0754
bmi 1.5508 1.0071 1.54 0.1237
bmi' -8.4486 9.0978 -0.93 0.3532
bmi'' 39.6413 37.1378 1.07 0.2859
bmi''' -54.8924 44.2677 -1.24 0.2151
age -0.5249 1.9490 -0.27 0.7877
age^2 0.0014 0.0148 0.10 0.9233
sbp 0.1209 0.0451 2.68 0.0074
drinkany=yes -3.7023 1.6544 -2.24 0.0253
physact=much less active -4.7408 3.8621 -1.23 0.2198
physact=much more active -0.2635 2.7391 -0.10 0.9234
physact=somewhat less active 0.0130 2.5101 0.01 0.9959
physact=somewhat more active 3.8031 2.0193 1.88 0.0598
smoking=yes -6.8961 12.0196 -0.57 0.5662
smoking=yes * bmi 0.4892 0.4375 1.12 0.2636
m2 from ols Analysis of Variance Response: ldl
Factor d.f. Partial SS MS F
bmi (Factor+Higher Order Factors) 5 2.758824e+04 5517.64861 4.11
All Interactions 1 1.679813e+03 1679.81344 1.25
Nonlinear 3 9.735452e+03 3245.15068 2.42
age 2 9.175762e+03 4587.88077 3.42
Nonlinear 1 1.244351e+01 12.44351 0.01
sbp 1 9.657476e+03 9657.47569 7.19
drinkany 1 6.726918e+03 6726.91809 5.01
physact 4 9.709992e+03 2427.49791 1.81
smoking (Factor+Higher Order Factors) 2 1.085405e+04 5427.02463 4.04
All Interactions 1 1.679813e+03 1679.81344 1.25
smoking * bmi (Factor+Higher Order Factors) 1 1.679813e+03 1679.81344 1.25
TOTAL NONLINEAR 4 9.738807e+03 2434.70175 1.81
TOTAL NONLINEAR + INTERACTION 5 1.171134e+04 2342.26845 1.74
REGRESSION 14 7.178905e+04 5127.78931 3.82
ERROR 2017 2.709327e+06 1343.24569
P
0.0010
0.2636
0.0647
0.0330
0.9233
0.0074
0.0253
0.1247
0.0177
0.2636
0.2636
0.1237
0.1214
<.0001
Complete cases only…
index.orig training test optimism index.corrected n
R-square 0.0257 0.0345 0.0184 0.0161 0.0096 40
MSE 1340.0254 1324.8222 1350.0695 -25.2473 1365.2727 40
g 6.6287 7.5809 5.9177 1.6632 4.9655 40
Intercept 0.0000 0.0000 31.2738 -31.2738 31.2738 40
Slope 1.0000 1.0000 0.7863 0.2137 0.7863 40
After single imputation…
index.orig training test optimism index.corrected n
R-square 0.0258 0.0337 0.0188 0.0150 0.0108 40
MSE 1333.3300 1336.3384 1342.9706 -6.6322 1339.9622 40
g 6.6306 7.5648 5.9723 1.5924 5.0382 40
Intercept 0.0000 0.0000 30.1440 -30.1440 30.1440 40
Slope 1.0000 1.0000 0.7932 0.2068 0.7932 40
summary(m2) results Effects Response : ldl
Factor Low High Diff.
bmi 24.2 30.263 6.0625
age 62.0 72.000 10.0000
sbp 120.0 145.000 25.0000
drinkany - yes:no 1.0 2.000 NA
physact - about as active:somewhat more active 5.0 1.000 NA
physact - much less active:somewhat more active 5.0 2.000 NA
physact - much more active:somewhat more active 5.0 3.000 NA
physact - somewhat less active:somewhat more active 5.0 4.000 NA
smoking - yes:no 1.0 2.000 NA
Effect S.E. Lower 0.95 Upper 0.95
5.1862 2.2217 0.82921 9.54330
-3.3412 1.3450 -5.97890 -0.70357
3.0218 1.1270 0.81165 5.23190
-3.7023 1.6544 -6.94690 -0.45779
-3.8031 2.0193 -7.76310 0.15695
-8.5439 3.9035 -16.19900 -0.88862
-4.0666 2.7125 -9.38630 1.25310
-3.7901 2.5633 -8.81720 1.23690
6.2635 2.4009 1.55500 10.97200
Adjusted to: bmi=26.9 smoking=no
m2m2Suppose we want a prediction for a new individual subject with bmi = 30, age = 50, smoking = yes, physact = about as active, drinkany= yes and sbp of 150.
predict(m2, expand.grid(bmi = 30, age = 50, sbp = 150, smoking = "yes",
physact = "about as active", drinkany = "yes"),
conf.int = 0.95, conf.type = "individual")$linear.predictors
1
160.9399
$lower
1
88.48615
$upper
1
233.3936
This is called a prediction interval.
The other prediction we might make is for the mean of a series of subjects with the same predictor values…
predict(m2, expand.grid(bmi = 30, age = 50, sbp = 150, smoking = "yes",
physact = "about as active", drinkany = "yes"),
conf.int = 0.95, conf.type = "mean")$linear.predictors
1
160.9399
$lower
1
151.8119
$upper
1
170.0679
Note that the confidence interval will always be narrower than the prediction interval given the same predictor values.
What do we have now?
fit3fit3 <- aregImpute(~ ldl + age + smoking + drinkany + sbp +
physact + bmi, nk = c(0, 3:5), tlinear = FALSE,
data = hers1, B = 10, n.impute = 20, x = TRUE)
m1 or m2)ols(ldl ~ rcs(bmi, 5) + pol(age, 2) + sbp +
drinkany + physact + smoking + smoking %ia% bmi,
x = TRUE, y = TRUE)
Put them together with fit.mult.impute()…
pr = FALSE it generates considerable output related to the imputations, which we won’t use today.m3imp resultsLinear Regression Model
fit.mult.impute(formula = ldl ~ rcs(bmi, 5) + pol(age, 2) + sbp +
drinkany + physact + smoking + smoking %ia% bmi, fitter = ols,
xtrans = fit3, data = hers1, pr = FALSE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 2032 LR chi2 52.74 R2 0.026
sigma36.7331 d.f. 14 R2 adj 0.019
d.f. 2017 Pr(> chi2) 0.0000 g 6.621
Residuals
Min 1Q Median 3Q Max
-113.345 -24.510 -3.803 20.777 197.295
Coef S.E. t Pr(>|t|)
Intercept 119.8951 67.8409 1.77 0.0773
bmi 1.5436 1.0097 1.53 0.1265
bmi' -8.3664 9.1409 -0.92 0.3602
bmi'' 39.2149 37.3458 1.05 0.2938
bmi''' -54.2873 44.5323 -1.22 0.2230
age -0.5002 1.9555 -0.26 0.7981
age^2 0.0012 0.0148 0.08 0.9351
sbp 0.1198 0.0454 2.64 0.0083
drinkany=yes -3.7196 1.6613 -2.24 0.0253
physact=much less active -4.7109 3.8716 -1.22 0.2238
physact=much more active -0.2328 2.7512 -0.08 0.9326
physact=somewhat less active -0.0417 2.5246 -0.02 0.9868
physact=somewhat more active 3.8197 2.0286 1.88 0.0599
smoking=yes -6.8967 12.0503 -0.57 0.5672
smoking=yes * bmi 0.4866 0.4389 1.11 0.2677
m3imp Analysis of Variance Response: ldl
Factor d.f. Partial SS MS
bmi (Factor+Higher Order Factors) 5 2.728300e+04 5456.600791
All Interactions 1 1.658459e+03 1658.458931
Nonlinear 3 9.585703e+03 3195.234412
age 2 9.320445e+03 4660.222299
Nonlinear 1 8.950493e+00 8.950493
sbp 1 9.407603e+03 9407.602954
drinkany 1 6.763854e+03 6763.853503
physact 4 9.698175e+03 2424.543639
smoking (Factor+Higher Order Factors) 2 1.031090e+04 5155.452328
All Interactions 1 1.658459e+03 1658.458931
smoking * bmi (Factor+Higher Order Factors) 1 1.658459e+03 1658.458931
TOTAL NONLINEAR 4 9.587178e+03 2396.794504
TOTAL NONLINEAR + INTERACTION 5 1.152744e+04 2305.487432
REGRESSION 14 7.030149e+04 5021.535034
ERROR 2017 2.721574e+06 1349.317884
F P
4.04 0.0012
1.23 0.2677
2.37 0.0690
3.45 0.0318
0.01 0.9351
6.97 0.0083
5.01 0.0253
1.80 0.1268
3.82 0.0221
1.23 0.2677
1.23 0.2677
1.78 0.1309
1.71 0.1293
3.72 <.0001
m3impm3imp Effects Response : ldl
Factor Low High Diff.
bmi 24.2 30.263 6.0625
age 62.0 72.000 10.0000
sbp 120.0 145.000 25.0000
drinkany - yes:no 1.0 2.000 NA
physact - about as active:somewhat more active 5.0 1.000 NA
physact - much less active:somewhat more active 5.0 2.000 NA
physact - much more active:somewhat more active 5.0 3.000 NA
physact - somewhat less active:somewhat more active 5.0 4.000 NA
smoking - yes:no 1.0 2.000 NA
Effect S.E. Lower 0.95 Upper 0.95
5.1643 2.2300 0.79099 9.53750
-3.3824 1.3518 -6.03340 -0.73144
2.9955 1.1345 0.77068 5.22040
-3.7196 1.6613 -6.97780 -0.46150
-3.8197 2.0286 -7.79800 0.15861
-8.5306 3.9152 -16.20900 -0.85228
-4.0525 2.7260 -9.39850 1.29350
-3.8614 2.5796 -8.92030 1.19760
6.1923 2.4427 1.40190 10.98300
Adjusted to: bmi=26.9 smoking=no
m3imparegImpute?fit.mult.impute?glance won’t work with an ols fit, but we can just use…
m3imp effects?aregImpute?fit3 was our imputation model here, built on the hers1 data, with subject identifiers in subject…
imputed_5 <- impute.transcan(fit3, data = hers1, imputation = 5,
list.out = T, pr = F, check = F)
imputed_df5 <- as.data.frame(do.call(cbind, imputed_5))
fifth_imp <- bind_cols(subject = hers1$subject, imputed_df5) |>
tibble() |> mutate_if(is.character, as.factor) |>
mutate(subject = as.character(subject))fifth_imp tibble# A tibble: 2,032 × 8
subject ldl age smoking drinkany sbp physact bmi
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 122. 70 1 1 138 3 23.7
2 2 242. 62 1 1 118 2 28.6
3 4 116. 64 2 2 152 2 24.4
4 5 151. 65 1 1 175 4 21.9
5 6 138. 68 1 2 174 1 29.0
6 8 121. 69 1 1 178 3 23.2
7 9 133 61 1 2 162 1 30.3
8 10 220 62 2 2 111 4 45.7
9 11 173. 72 1 1 122 1 22.2
10 12 124. 73 1 1 158 5 25.3
# ℹ 2,022 more rows
[1] 0
lm for 5th imputation?model_for_imp5 <-
lm(ldl ~ rcs(bmi, 5) + pol(age, 2) + sbp +
drinkany + physact + smoking +
smoking %ia% bmi, data = fifth_imp)
model_for_imp5
Call:
lm(formula = ldl ~ rcs(bmi, 5) + pol(age, 2) + sbp + drinkany +
physact + smoking + smoking %ia% bmi, data = fifth_imp)
Coefficients:
(Intercept) rcs(bmi, 5)bmi rcs(bmi, 5)bmi' rcs(bmi, 5)bmi''
111.901199 1.032315 -7.262365 33.206787
rcs(bmi, 5)bmi''' pol(age, 2)age pol(age, 2)age^2 sbp
-46.082107 -0.024238 -0.002339 0.125763
drinkany physact smoking smoking %ia% bmi
-3.700570 1.017248 -6.928039 0.500132
m3imp in imputation 5| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.025 | 0.020 | 36.946 | 4.794 | 0.000 | 11 | -10211.66 | 20449.32 | 20522.34 | 2757257 | 2020 | 2032 |
Analysis of Variance Table
Response: ldl
Df Sum Sq Mean Sq F value Pr(>F)
rcs(bmi, 5) 4 27169 6792.2 4.9761 0.0005428 ***
pol(age, 2) 2 8836 4417.8 3.2366 0.0395027 *
sbp 1 12237 12237.3 8.9652 0.0027852 **
drinkany 1 6544 6543.8 4.7941 0.0286721 *
physact 1 5343 5343.4 3.9146 0.0480038 *
smoking 1 10090 10090.5 7.3924 0.0066060 **
smoking %ia% bmi 1 1761 1761.3 1.2903 0.2561224
Residuals 2020 2757257 1365.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Logistic Regression: Predicting a Binary Outcome
432 Class 06 | 2024-02-01 | https://thomaselove.github.io/432-2024/